Main Idea
Dimensionality reduction is like taking a photo
- Reduce dimensions: 4D (?) -> 2D
- Make sure everybody is visible (keep maximum information)
- Getting the right “angle” is important
Module 3
University of South Florida
Starts by each point as a cluster by itself
Combine the nearest clusters
Keep combining until all of them are clustered to one
Does not require specifying number of clusters
The length of dendrogram lines represent the distance between combined clusters
Rule of thumb: draw a horizontal line that crosses “large” distances
Density-based cluserting works well for perceptional data:
A Very popular algorithm in Machine Leaning and Finance.
A statistical method to reduce the dimensionality
Dimensionality reduction is like taking a photo
Reduced dimensions can make (huge) distortions when not done properly
Formally PCA is:
Linearly transformation of original variables (p) into a new set variables (q <= p) such that
Main use of PCA is the dimensionality reduction
An orthogonal projection of data into lower dimension space: (e.g., 2D -> 1D)
that minimizes distance from original points (red) to projection (green)
that retains maximum variance between projected data points
Note when the red dots are:
Note that each dots are “distinct”:
In order to make them distinguishable in the lower dimension space, they must be spread out as much as possible.
Contrarily, if many of dots are crammed and overlapped, then we cannot distinguish them in lower dimension space.
Eigenvector is the new vector direction
Eigenvalue is similar to the length of eigenvector
The information content (or importance)
It is variance of the corresponding principal component
Generally faster
Works on \(m \times n\) matrix
Scales well, numerically stable because it does not require computing covariance
\[PC_{n \times p} = Z V_{p \times p}\]
Factor loadings are weighted eigenvectors. Needed for Interpretable Machine Learning.
\[L_{p \times p} = V_{p \times p} \sqrt{\Lambda_{p \times p}}\]
\(L_{ij}\) is the loading of variable \(i\) on PC \(j\).
Interpretation:
If you have a data with 1,000 observations and 200 variables:
Q1. How many principal components (PCs) will I have?
Q2. How’s PCA output look like?
Q3. How can it reduce the variables, then?
Q4. How many PCs should I choose?
Q5. Which PC is the most important?
Q6. What do you mean most important?
Q7. Is that importance same as Eigenvalue?
Q8. How many Eigenvalues will it have?
Input:
An data with \(n\) observations and \(p\) features
\[ X_{n\times p} \]
Primary Output:
A data with \(n\) observations and new \(q\) features (\(q \leq p\))
\[ PC_{n \times q} \]
Extra output:
\[ \lambda_{1,2,3,...,q} \]
\[ L_{p \times q} \]
Factor loadings: How those new \(q\) principal components are constructed by original data?
Steps
Usually ML packages handels all steps at once, and shows the output summary nicely.
The code is to demonstrate the steps performed behind the scene.
For simplicity, I’m intentionally making two variables that are highly correlated.
Create Y as linearly correlated variable. This is to show how PCA captures the maximum variance when reducing dimensions.
Z-score standardization:
\[G_{p \times p} = \frac{1}{n-1} Z^T Z\]
Eigen Value Decomposition: \[G_{p \times p} = V_{p \times p} \Lambda_{p \times p} V^T_{p \times p}\]
\[PC_{n \times p} = ZV \]
Eigenvalues represent the variance captured by each principal component. Calculate the proportion of variance explained:
Factor loadings help understand how original variable contribute to principal components. \[L = V\cdot \text{diag}(\Lambda)\]
[,1]
[1,] 1.365999
[2,] 1.414214
Out of 2 dimension data:
PC1 captures maximum variablity of the data
PC2 captures the remaining variability
The more correlated, the more effective dimensionality reductions
In this step, we perform Principal Component Analysis (PCA) using the H2O framework.
First, the dataset is converted to an H2O frame while excluding non-numeric columns (Country, Abbrev).
The PCA is performed using the h2o.prcomp() function:
k = 4: specifies the number of principal components to compute.
transform = "STANDARDIZE": standardizes (center and scale) all variables before applying PCA.
Real GDP growth (from IMF) : Higher the better
Corruption Index (Transparency International) : Higher the better (no corruption)
Peace Index (Institute for Economics and Peace) : Lower the better (very peaceful)
Legal risk index (Property Rights Association) : Higher the better (favorable)
# A tibble: 6 × 6
country abbrev corruption peace legal gdp_growth
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Albania AL 35 1.82 4.55 2.98
2 Algeria DZ 35 2.22 4.43 2.55
3 Argentina AR 45 1.99 5.09 -3.06
4 Armenia AM 42 2.29 4.81 6
5 Australia AU 77 1.42 8.36 1.71
6 Austria AT 77 1.29 8.09 1.60
Initiate H2O for ML
# Convert data to H2O frame, removing non-numeric columns
country_risk_h2o <- as.h2o(
country_risk
)
# Build PCA model
pca_h2o <- h2o.prcomp(
training_frame = country_risk_h2o,
x = c("corruption", "peace", "legal", "gdp_growth"),
k = 4, # number of pricipal components (in this case, p = q)
transform = "STANDARDIZE" # center & scale data
)To generate PCs, simply make prediction with the PCA model.
The model summary provides details of variance explained.
Model Details:
==============
H2ODimReductionModel: pca
Model ID: PCA_model_R_1771945215033_165
Importance of components:
pc1 pc2 pc3 pc4
Standard deviation 1.600254 1.001183 0.614453 0.243450
Proportion of Variance 0.640203 0.250592 0.094388 0.014817
Cumulative Proportion 0.640203 0.890795 0.985183 1.000000
H2ODimReductionMetrics: pca
No model metrics available for PCA
To see the factor loadings for each PC: we need to pull eigenvalues and eigenvectors.
mtcars dataWith mtcars data,
Perform PCA and report explained variances of each principal component.
Reduce dimensions into 2 using PCA.
Report Principal Component dataframe with 2 columns
How many variables should there be to explain at least 95% or variations?
John C. Hull “Machine Learning in Business”
FIN6776: Big Data and Machine Learning in Finance